In this work, we propose an approach to the spatiotemporal localisation (detection)and classification of multiple concurrent actions within temporally untrimmed videos.Our framework is composed of three stages. In stage 1, appearance and motion detectionnetworks are employed to localise and score actions from colour images and opticalflow. In stage 2, the appearance network detections are boosted by combining them withthe motion detection scores, in proportion to their respective spatial overlap. In stage 3,sequences of detection boxes most likely to be associated with a single action instance,called action tubes, are constructed by solving two energy maximisation problems viadynamic programming. While in the first pass, action paths spanning the whole videoare built by linking detection boxes over time using their class-specific scores and theirspatial overlap, in the second pass, temporal trimming is performed by ensuring labelconsistency for all constituting detection boxes. We demonstrate the performance of ouralgorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achievingnew state-of-the-art results across the board and significantly increasing detectionspeed at test time.
展开▼